{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get Pythonic with the `collections` module"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First day: your new data structure friends"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> The collections module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, dict, list, set, and tuple. - [docs](https://docs.python.org/3.7/library/collections.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this Day 1 lesson we will use the most common `collections` types showing some easy to follow examples. Day 2 we will get some more practice using them on a movie data set. Day 3 we have you further practice using your own data. I am using Python 3.6 for this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict, namedtuple, Counter, deque\n",
    "import csv\n",
    "import random\n",
    "from urllib.request import urlretrieve"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. `namedtuple`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `namedtuple` is a convenient way to define a class without methods. This allows you to store `dict` like objects you can access by attributes. Let's first look at a classic `tuple`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "user = ('bob', 'coder')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The order is not really meaningful leading to ugly code to output the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'bob is a coder'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f'{user[0]} is a {user[1]}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's contrast that with a `namedtuple`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "User = namedtuple('User', 'name role')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can directly see that the object has a name and role:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "user = User(name='bob', role='coder')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'bob'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'coder'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user.role"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Making last string much more informational and elegant (f-strings helps too of course - now you know why we use Python >= 3.6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'bob is a coder'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f'{user.name} is a {user.role}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Conclusion: use a `namedtuple` wherever you can! They are easy to implement and make your code more readable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. `defaultdict`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I guess you are all too familiar with `KeyError` when using a `dict`, no? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "users = {'bob': 'coder'}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'julian'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-10-2d801425e069>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0musers\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'bob'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0musers\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'julian'\u001b[0m\u001b[0;34m]\u001b[0m  \u001b[0;31m# oops\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m: 'julian'"
     ]
    }
   ],
   "source": [
    "users['bob']\n",
    "users['julian']  # oops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can get around it with dict's get method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'coder'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users.get('bob')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users.get('julian') is None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But what if you need to build up a collection though? Let's make a dict from the following list of tuples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('mike', 10),\n",
       " ('julian', 7),\n",
       " ('bob', 5),\n",
       " ('mike', 11),\n",
       " ('julian', 8),\n",
       " ('bob', 6)]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "challenges_done = [('mike', 10), ('julian', 7), ('bob', 5),\n",
    "                   ('mike', 11), ('julian', 8), ('bob', 6)]\n",
    "challenges_done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'mike'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-14-5d8074648e43>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mchallenges\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mchallenge\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mchallenges_done\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m     \u001b[0mchallenges\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mchallenge\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m: 'mike'"
     ]
    }
   ],
   "source": [
    "challenges = {}\n",
    "for name, challenge in challenges_done:\n",
    "    challenges[name].append(challenge)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(list, {'bob': [5, 6], 'julian': [7, 8], 'mike': [10, 11]})"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "challenges = defaultdict(list)\n",
    "for name, challenge in challenges_done:\n",
    "    challenges[name].append(challenge)\n",
    "\n",
    "challenges"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. `Counter`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of my favorites. Say you want to count the most common words in a text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Lorem', 'Ipsum', 'is', 'simply', 'dummy']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = \"\"\"Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been \n",
    "the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and \n",
    "scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into \n",
    "electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of\n",
    "Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus\n",
    "PageMaker including versions of Lorem Ipsum\"\"\".split()\n",
    "words[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before getting to know `collections` I would has written something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the 6\n",
      "Lorem 4\n",
      "Ipsum 4\n",
      "of 4\n",
      "and 3\n"
     ]
    }
   ],
   "source": [
    "common_words = {}\n",
    "\n",
    "for word in words:\n",
    "    if word not in common_words:\n",
    "        common_words[word] = 0\n",
    "    common_words[word] += 1\n",
    "\n",
    "# sort dict by values descending and slice first 5 to get most common\n",
    "for k, v in sorted(common_words.items(),\n",
    "                   key=lambda x: x[1],\n",
    "                   reverse=True)[:5]:\n",
    "    print(k ,v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now compare this to using `Counter` and its `most_common` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('the', 6), ('Lorem', 4), ('Ipsum', 4), ('of', 4), ('and', 3)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(words).most_common(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "WOW!\n",
    "\n",
    "When discovering this my mind was blown and it was a flag to myself that I had to study Python's [awesome standard library](https://docs.python.org/3/library/index.html) more, because it has these Pythonic idioms you can use right out of the box. They will make your code shorter (= less buggy) and more elegant!\n",
    "\n",
    "Another aha moment was [`deque`](https://pybit.es/collections-deque.html) which we will cover next."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. `deque`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Deques are a generalization of stacks and queues (the name is pronounced “deck” and is short for “double-ended queue”). Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction. - [docs](https://docs.python.org/3.7/library/collections.html)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lists in Python are awesome, probably one of your goto data structure, because they are so widely used and convenient. \n",
    "\n",
    "However certain operatings (delete, insert) can get slow when your `list` grows - see [TimeComplexity](https://wiki.python.org/moin/TimeComplexity) for more details. \n",
    "\n",
    "If you need to add/remove at both ends of a collection, consider using a `deque`. Let's show this with a practical example using the `timeit` module to measure performance:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we create two 10 million integers with `range` storing one in a `list ` and the other in a `deque`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "lst = list(range(10000000))\n",
    "deq = deque(range(10000000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some removing and inserting at random locations in the sequence, a `list` is slow at this because it needs to move all adjacent around ([Grokking Algorithms](https://pybit.es/grokking_algorithms.html) explains this really well). Here is where `deque` is a big win:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "447 ms ± 45.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
      "83.7 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "def insert_and_delete(ds):\n",
    "    for _ in range(10):\n",
    "        index = random.choice(range(100))\n",
    "        ds.remove(index)\n",
    "        ds.insert(index, index)\n",
    "\n",
    "%timeit insert_and_delete(lst)\n",
    "\n",
    "%timeit insert_and_delete(deq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So when performance matters you really want to explore the alternative data structures in the `collections` module. Another example of a performance win is `ChainMap`:\n",
    "\n",
    "> A ChainMap groups multiple dicts or other mappings together to create a single, updateable view. If no maps are specified, a single empty dictionary is provided so that a new chain always has at least one mapping. - [docs](https://docs.python.org/3.7/library/collections.html#collections.ChainMap)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second day: practice using movie data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For [Code Challenge 13 - Highest Rated Movie Directors](https://pybit.es/codechallenge13.html) we used a nice movie data set. Let's import it here to see some of our newly learned `collections` data types in action."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make a `defaultdict` of movies per directory (keys = directors, values = list of movies). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('movies.csv', <http.client.HTTPMessage at 0x10fee00f0>)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_data = 'https://raw.githubusercontent.com/pybites/challenges/solutions/13/movie_metadata.csv'\n",
    "movies_csv = 'movies.csv'\n",
    "urlretrieve(movie_data, movies_csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `namedtuple` is ideal here to describe a movie so we can access movie attributes (e.g. `movie.score`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "Movie = namedtuple('Movie', 'title year score')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need some CSV parsing as well here, no worries we got you covered:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_movies_by_director(data=movies_csv):\n",
    "    \"\"\"Extracts all movies from csv and stores them in a dictionary\n",
    "       where keys are directors, and values is a list of movies (named tuples)\"\"\"\n",
    "    directors = defaultdict(list)\n",
    "    with open(data, encoding='utf-8') as f:\n",
    "        for line in csv.DictReader(f):\n",
    "            try:\n",
    "                director = line['director_name']\n",
    "                movie = line['movie_title'].replace('\\xa0', '')\n",
    "                year = int(line['title_year'])\n",
    "                score = float(line['imdb_score'])\n",
    "            except ValueError:\n",
    "                continue\n",
    "\n",
    "            m = Movie(title=movie, year=year, score=score)\n",
    "            directors[director].append(m)\n",
    "\n",
    "    return directors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "directors = get_movies_by_director()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking up Christopher Nolan we get all his movies nicely stored in `Movie` `namedtuple` objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Movie(title='The Dark Knight Rises', year=2012, score=8.5),\n",
       " Movie(title='The Dark Knight', year=2008, score=9.0),\n",
       " Movie(title='Interstellar', year=2014, score=8.6),\n",
       " Movie(title='Inception', year=2010, score=8.8),\n",
       " Movie(title='Batman Begins', year=2005, score=8.3),\n",
       " Movie(title='Insomnia', year=2002, score=7.2),\n",
       " Movie(title='The Prestige', year=2006, score=8.5),\n",
       " Movie(title='Memento', year=2000, score=8.5)]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "directors['Christopher Nolan']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "You can do a lot with this data set and [we challenge you to do so](https://pybit.es/codechallenge13.html), but for this example let's just get the 5 directors with the most movies. \n",
    "\n",
    "Of course we don't want to re-invent the wheel so we use `Counter`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Steven Spielberg', 26),\n",
       " ('Woody Allen', 22),\n",
       " ('Martin Scorsese', 20),\n",
       " ('Clint Eastwood', 20),\n",
       " ('Ridley Scott', 17)]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cnt = Counter()\n",
    "for director, movies in directors.items():\n",
    "    cnt[director] += len(movies)\n",
    "\n",
    "cnt.most_common(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Third day: more practice on your own data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We challenge you to find your own data set and try to use the new `collections` data structures yourself. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Stuck at finding examples? We used `collections` quite a bit for our own [100 Days of Code](https://github.com/pybites/100DaysOfCode/blob/master/LOG.md):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```$ python module_index.py |grep collections\n",
    "collections        | stdlib | 001, 021, 023, 034, 036, 042, 045, 055, 057, 063, 076, 084, 086, 095, 096```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Time to share what you've accomplished!\n",
    "\n",
    "Be sure to share your last couple of days work on Twitter or Facebook. Use the hashtag **#100DaysOfCode**. \n",
    "\n",
    "Here are [some examples](https://twitter.com/search?q=%23100DaysOfCode) to inspire you. Consider including [@talkpython](https://twitter.com/talkpython) and [@pybites](https://twitter.com/pybites) in your tweets.\n",
    "\n",
    "*See a mistake in these instructions? Please [submit a new issue](https://github.com/talkpython/100daysofcode-with-python-course/issues) or fix it and [submit a PR](https://github.com/talkpython/100daysofcode-with-python-course/pulls).*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
